Geographical analysis of media flows

A multidimensional approach

Claude Grasland (Université de Paris (Diderot), FR 2007 CIST, UMR 8504 Géographie-cités)

featured

Introduction

1 Corpus preparation

The aim of this section is to prepare a corpus of news related to a language and one or several countries over a period of time. As an example, we will try to prepare a corpus of news in french (fr) related to France (FRA), Belgium (BEL) and Algeria (DZA) over a period of 2 years from 1st Jan 2014 to 31th December 2015. As news include not only titles but also descriptions, we decide to break the descriptions in sentences and keep a maximum of 2 sentences by news, with a maximum of 100 tokens by sentence.

The data used in this example has been collected by the research project ANR Geomedia and are free to use for scientific and pedagogical purpose only. The content of the news should not be used or disseminated without the agreement of the newspapers.

1.1 Selection of media

We import the data provided by each media and put them in a single data.frame. Then we select the columns of interest

# Load data with the function fread (fast) and the encoding UTF-8
  df1<-fread("data/source/fr_FRA_figaro_int.csv", encoding = "UTF-8")
  df1$media<-"fr_FRA_figaro"
  
  
  df2<-fread("data/source/fr_FRA_libera_int.csv", encoding = "UTF-8")
  df2$media<-"fr_FRA_libera"
  
  df3<-fread("data/source/fr_BEL_derheu_int.csv", encoding = "UTF-8")
  df3$media<-"fr_BEL_derheu"
  
  df4<-fread("data/source/fr_BEL_lesoir_int.csv", encoding = "UTF-8")
  df4$media<-"fr_BEL_lesoir"
  
  df5<-fread("data/source/fr_DZA_elwata_int.csv", encoding = "UTF-8")
  df5$media<-"fr_DZA_elwata"
  
  df6<-fread("data/source/fr_DZA_xpress_int.csv", encoding = "UTF-8")
  df6$media<-"fr_DZA_xpress"
  
  # transform in data.table format
  df<-rbind(df1,df2,df3,df4,df4,df5,df6)
  rm(df1,df2,df3,df4,df5,df6)
  
  
  # select column of interest
  df$id <- df$ID_Item
  df$who <- df$media
  df$when <- df$Date_Recup
  df$text <- paste(df$Titre," . ", df$Description, sep="")
  df<-df[,c("id","who","when","text")]
  df<-df[order(when),]
  
  # select period of interest
  mintime<-as.Date("2014-01-01")
  maxtime<-as.Date("2015-12-31")
  df<-df[(is.na(df$when)==F),] 
  df<-df[as.Date(df$when) >= mintime,]
  df<-df[as.Date(df$when) <= maxtime,]
  
  # eliminate duplicate
  df<-df[duplicated(df$text)==F,]

1.2 Check of time frequency

1.2.1 Time divisions

We transform the previous data.frame in a data.table format for easier operations of aggregation

dt<-as.data.table(df)
  dt$day     <- as.Date(dt$when)
  dt$week    <- cut(dt$day, "weeks", start.on.monday=TRUE)
  dt$month   <- cut(dt$day, "months")
  dt$weekday <- weekdays(dt$day)
  
  # Save data frame
  saveRDS(dt,"data/corpus/dt_mycorpus.RDS") 

1.2.2 News by week

We examine if the distribution is regular by week for the different media of the corpus.

dt<-readRDS("data/corpus/dt_mycorpus.RDS")
  news_weeks<-dt[,.(newstot=.N),by=.(week,who)]
  
  p<-ggplot(news_weeks, aes(x=as.Date(week),y=newstot, col=who))+
     geom_line()+
     geom_smooth(method = 'loess', formula = 'y~x')+
     scale_y_continuous("Number of news", limits = c(0,NA)) +
     scale_x_date("Week (starting on monday)") +
           ggtitle(label ="Corpus : distribution of news by week",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

1.2.3 News by weekday

We examine if the distribution is regular by weekday and check in particular the effect of the week-end.

#compute frequencies by weekday
  news_weekdays<-dt[,.(newstot=.N),by=.(weekday,who)]
  news_weekdays<-news_weekdays[,.(weekday,newspct=100*newstot/sum(newstot)),by=.(who)]
  
  
  # Translate weekdays in english and order
  news_weekdays$weekday<-as.factor(news_weekdays$weekday)
  levels(news_weekdays$weekday)<-c("7.Sunday","4.Wednesday","1.Monday","2.Tuesday","3.Thursday","6.Sathurday","5.Friday")
  news_weekdays$wkd<-as.factor(as.character(news_weekdays$weekday))
  news_weekdays<-news_weekdays[order(news_weekdays$weekday),]
  
  
  p<-ggplot(news_weekdays, aes(x=weekday,fill = who, y=newspct))+
           geom_bar(position = "dodge", stat="identity")+
           scale_y_continuous("Share of news (%)", limits = c(0,NA)) +
           ggtitle(label ="Corpus : distribution of news by week day",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

1.3 Transform in quanteda corpus

1.3.1 Reshape news by sentences

The aim of this step is to harmonize the length of texts collected through rss. We decide to keep only the title of news and the two first sentences of descriptions when they are available. The result is stored in quanteda format.

Unfortunately, the division in text is sentences realized by quanteda is far from perfect which is due to problems in the collection of news. For example, the following text will be considered as a single sentence because the point is not followed by a blank character.

Le conflit est terminé.Mais la Russie est-elle d’accord avec la Turquie.

It is necessary to add a regular expression for the cleaning of text and the inclusion of a blank space " " after each point located after a lower case character and before an upper case character :

str_replace_all(txt,“(?<=[:lower:])\.(?=[:upper:])”, “\.”)

In order to obtain a text that will be recognised as made of 2 sentences.

Le conflit est terminé. Mais la Russie est-elle d’accord avec la Turquie.

Some sentences appears too short or too long for a sound analysis. Therefore, we decide to eliminate outliers based on the quantile of the numbe of tokens. In practice we decide to eliminate the sentences with more than 100 tokens or less than 3 tokensr

t1<-Sys.time()
  dt<-readRDS("data/corpus/dt_mycorpus.RDS")
  
  
  
  # clean sentences break (long !)
  dt$text<-str_replace_all(dt$text,"(?<=[:lower:])\\.(?=[:upper:])", "\\. ")
  
  
  # transform in quanteda
  qd<-corpus(dt,docid_field = "id",text_field = "text")
  
  
  
  # break in sentences
  qd<-corpus_reshape(qd,to="sentences", use_docvars=T)
  
  # Identify rank of sentences
  qd$order<-as.numeric(as.data.frame(str_split(names(qd),"\\.", simplify=T))[,2])
  
  # Select only title + maximum of 3 sentences
  qd<-corpus_subset(qd, order < 5)
  
  # filter by number of tokens by sentence
  qd$nbt<-ntoken(texts(qd))
  #mintok<-quantile(qd$nbt,0.01)
  #maxtok<-quantile(qd$nbt,0.99)
  #qd<-corpus_subset(qd, nbt>mintok)
  qd<-corpus_subset(qd, nbt<100)
  qd<-corpus_subset(qd, nbt>2)
  
  
  
  # Save corpus in qd format
  saveRDS(qd,"data/corpus/qd_mycorpus.RDS")
  
  t2<-Sys.time()
  paste("Program executed in ", t2-t1)
  
  head(qd)
  summary(qd,3)

1.3.2 Number of sentences by media

We check the number of sentences available by title (1) and order of sentences in description (2 to 5)

qd<-readRDS("data/corpus/qd_mycorpus.RDS")
  x<-data.table(docvars(qd))
  
  
  tab<-x[,.(tot=.N),by=.(who,order)]
  tab<-dcast(tab,order~who)
  tab$order<-as.factor(tab$order)
  levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
  kable(tab, caption = "Distribution of title and sentences by media")
Distribution of title and sentences by media
order fr_BEL_derheu fr_BEL_lesoir fr_DZA_elwata fr_DZA_xpress fr_FRA_figaro fr_FRA_libera
Title 6994 10815 2896 4794 9449 13703
Sent1 6962 10730 2925 4755 9423 11227
Sent2 1884 3517 2893 249 3591 3453
Sent3 518 995 2867 16 536 444

1.3.3 Size of texts by month

We visualize the distribution of sentences of different order through time in order to prepare a decision on the length of text to be kept.

tab<-x[,.(tot=.N),by=.(month,order)]
  tab$month<-as.Date(tab$month)
  tab$order<-as.factor(tab$order)
  levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
  
         
         p<-ggplot(tab, aes(x=month,fill = order, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Corpus : distribution of titles and sentences by month",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

2 Geographical tags

The aim of this section is to add to the quanteda corpus different metadata related to the geographical entities that are mentioned in the news. We do not discuss here the problems related to the choice of a list of entities and we just apply a method of recognition based on a dictionary. It is theoretically possible to recognize a great number of spatial entities (regions, continents, cities, …) put we will limit here our research to the case of states recognized at UN and some adding territories partly recognized like Kosovo, Northern Cyprus or Taiwan.

2.1 Preparation of data

2.1.1 Load dictonary

We start by loading the last version of the Imageun dictionary and we extract our target language (here : french).

# Load multilanguage dictionary
  dict<-fread("data/dico_states/global_state_V2.csv")
  
  # Select french dictionary
  dict <- dict[dict$lang=="fr",]
  
  
  head(dict)
   ISO3           x lang
  1:  ABK     abkhaz*   fr
  2:  AFG     afghan*   fr
  3:  AFG      kaboul   fr
  4:  AFG afghanistan   fr
  5:  AGO      angola   fr
  6:  AGO   angolais*   fr

2.1.2 Load corpus

qd <- readRDS("data/corpus/qd_mycorpus.RDS")

2.1.3 Load tagging function

extract_tags <- function(qd = qd,                      # the corpus of interest
                           lang = "fr",                  # the language to be used
                           dict = dict,                  # the dictionary of target 
                           code = "ISO3" ,                # variable used for coding
                           alias = "x",                   # variable used for alias
                           tagsname = "states",           # name of the tags column
                           split  = c("'","’","-"),       # split list
                           tolow = TRUE  ,                # Tokenize text
                           comps = c("Afrique du sud")  # compounds
                           )
  { 
  
  
    
  # Tokenize  
  x<-as.character(qd)
  
  
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         x <- gsub(reg," ",x)}  
  if(tolow) { x <- tolower(x)} 
  toks<-tokens(x)
  
  # compounds
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         comps<- gsub(reg," ",comps)}  
  if(tolow)       {comps <- tolower(comps)}  
  toks<-tokens_compound(toks,pattern=phrase(comps))
  
    
  # Load dictionaries and create compounds
  
    ## Target dictionary
  
  labels <-dict[[alias]]
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         labels<- gsub(reg," ",labels)}  
  if(tolow)       {labels <- tolower(labels)}  
  toks<-tokens_compound(toks,pattern=phrase(labels))
    
   # create quanteda dictionary
  keys <-gsub(" ","_",labels)
  qd_dict<-as.list(keys)
  names(qd_dict)<-dict[[code]]
  qd_dict<-dictionary(qd_dict,tolower = FALSE)
  
  # Identify geo tags (states or reg or org ...)
  toks_tags <- tokens_lookup(toks, qd_dict, case_insensitive = F)
  toks_tags <- lapply(toks_tags, unique)
  toks_tags<-as.tokens(toks_tags)
  list_tags<-function(x){res<-paste(x, collapse=' ')}
  docvars(qd)[[tagsname]]<-as.character(lapply(toks_tags,FUN=list_tags))
  docvars(qd)[[paste("nb",tagsname,sep="")]]<-ntoken(toks_tags)
  
  
  
  # Export results
  return(qd)
   }

2.2 Geographical annotation

2.2.1 Annotate all entities

In a first step, we annotate all geographic entities together in order to benefit from the cross-definition of their respective compounds. We will separate them by subcategories in a second step.

t1<-Sys.time()
  
  frcomps<-c("mer de Chine", "océan indien", "continent américain", "pays américains")
  
  qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict,
                       code = "ISO3",
                       alias = "x",
                       tagsname = "states",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  t2 = Sys.time()
  paste("Program executed in ", t2-t1)
  
  table(qd$nbstates)

2.2.2 check news with maximum state number

table(qd$nbstates)

      0     1     2     3     4     5     6     7     8     9    10    11    12 
  52314 48299 12827  1749   310    74    33    17     6     1     1     2     1 
     13    19 
      1     1 
check<-corpus_subset(qd,nbstates>9)
  x<-data.frame(who=check$who,when = check$when,text=as.character(check),states=check$states,nbstates=check$nbstates)
  x<-x[order(x$nbstates,decreasing = T),]
  kable(x)
who when text states nbstates
9845355 fr_DZA_elwata 2015-10-02 11:31:33 Iran: 464 morts, Egypte : 75, Nigeria : 64, Mali : 60, Indonésie : 57, Inde : 45 morts, Pakistan : 46, Niger : 22, Cameroun : 20, Côte d’Ivoire : 14, Tchad : 11, Algérie : 11, Sénégal : 10, Maroc : 10, Somalie : 8, Libye : 4, Tanzanie : 4, Kenya : 3, Tunisie : 2 et Burkina Faso, Burundi et Pays-Bas : 1 mort chacun. IRN EGY NGA MLI IND PAK NER CMR CIV TCD MAR SOM LBY TZA KEN TUN BFA BDI NLD 19
148030 fr_DZA_elwata 2014-01-13 11:47:24 La réunion des ministres des Affaires étrangères des onze (Grande-Bretagne, Allemagne, Italie, France, Arabie Saoudite, Emirats arabes unis, Qatar, Egypte, Jordanie, Etats-Unis, Turquie) s’est ouverte, hier à Paris, en présence du président de la Coalition, Ahmad Jarba, pour discuter de la participation toujours incertaine de l’opposition à la conférence de paix sur la Syrie à partir du 22 janvier en Suisse. GBR DEU ITA FRA SAU ARE QAT EGY JOR USA TUR SYR CHE 13
131590 fr_DZA_elwata 2014-01-10 10:56:30 Les 11 pays du groupe (Etats-Unis, Grande-Bretagne, France, Allemagne, Italie, Turquie, Egypte, Jordanie, Qatar, Arabie Saoudite et Emirats arabes unis) discuteront de questions relatives à la Syrie avant la conférence de Genève 2, prévue pour le 22 janvier prochain, a indiqué la porte-parole du département d’Etat, Jen Psaki. USA GBR FRA DEU ITA TUR EGY JOR QAT SAU ARE SYR 12
4252357 fr_DZA_elwata 2015-01-11 10:17:19 Seront présents les ministres allemand, autrichien, belge, britannique, danois, espagnol, italien, letton, néerlandais, polonais et suédois. DEU AUT BEL GBR DNK ESP ITA LVA NLD POL SWE 11
7377071 fr_DZA_elwata 2015-06-10 11:57:29 Durant cette rencontre, il est attendu environ 70 000 visiteurs professionnels venus des pays suivants : Pérou, France, Pays-Bas, Etats-Unis, Chine,  Hongrie, Espagne, Croatie, Russie, Danemark, Japon et Allemagne. FRA NLD USA CHN HUN ESP HRV RUS DNK JPN DEU 11
7360024 fr_DZA_elwata 2015-06-09 17:57:27 L’Algérie, l’Egypte, la Tunisie, le Maroc, les 5+5 de la rive nord de la Méditerranée (Espagne, Italie, France, Portugal, Malte) et les Etats-Unis mènent depuis le mois de mars un embargo maritime dans les ports occupés par l’organisation de l’Etat islamique en Libye. EGY TUN MAR ESP ITA FRA PRT MLT USA LBY 10

2.2.3 Save geographically anotated corpus

saveRDS(qd,"data/corpus/qd_mycorpus_states.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/qd_mycorpus_states.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  5.984 Mo"

3 Thematic tags

The aim of this section is to add to the quanteda corpus different metadata related to “thematic” tags i.e. topics that can be considered as related to the same same types of events but without considering space and time. In other words a thematic tag should not be related to an isolated crisis or an isolated country but to something more general that could take place anytime and anywhere. As an example, the earthquake from Nepal in April 2015 is not a topic but an event which belongs to the general category of news about seismic risk. The Ebola outbreak from western Africa is different because it is not an isolated event but a group of events spreading through space and time. It can therefore be considered as a sub-topic of the more general topic of pandemics. Finally the so-called “migrant crisis” that occurred in Europe in 2015 is certainly a mega-event but it is probably not a good idea to choose it as a topic. It is more interesting in this case to look for all news associated to the topic of “human mobility accross borders” and to search for news speaking from migrant, refugges or asylum seekers in all countries and at all periods of time. The problem of topic’s definition is therefore firstly a conceptual problem and only in a second step a question of keywords and dictionary. It the topic is not well defined, iy will not be possible to propose a consistent and coherent method of identification of news related to the topic.

The thematic tags will be added to the corpus previously annotated by geographical units.

qd <-readRDS("data/corpus/qd_mycorpus_states.RDS")

3.1 Load tagging function

We can use exactly the same program of tagging than the one we have used previously for the tagging of geographical entities. The only condition is to prepare a dataframe with three columns

  • lang : the language associated to the dictionary
  • code : the tag to be used
  • label: the words or groups of words to be recognized
extract_tags <- function(qd = qd,                      # the corpus of interest
                           lang = "fr",                  # the language to be used
                           dict = dict,                  # the dictionary of target 
                           code = "code" ,                # variable used for coding
                           alias = "alias",               # variable used for alias
                           tagsname = "states",           # name of the tags column
                           split  = c("'","’","-"),       # split list
                           tolow = TRUE  ,                # Tokenize text
                           comps = c("Afrique du sud")  # compounds
                           )
  { 
  
  
    
  # Tokenize  
  x<-as.character(qd)
  
  
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         x <- gsub(reg," ",x)}  
  if(tolow) { x <- tolower(x)} 
  toks<-tokens(x)
  
  # compounds
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         comps<- gsub(reg," ",comps)}  
  if(tolow)       {comps <- tolower(comps)}  
  toks<-tokens_compound(toks,pattern=phrase(comps))
  
    
  # Load dictionaries and create compounds
  
    ## Target dictionary
  
  labels <-dict[[alias]]
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         labels<- gsub(reg," ",labels)}  
  if(tolow)       {labels <- tolower(labels)}  
  toks<-tokens_compound(toks,pattern=phrase(labels))
    
   # create quanteda dictionary
  keys <-gsub(" ","_",labels)
  qd_dict<-as.list(keys)
  names(qd_dict)<-dict[[code]]
  qd_dict<-dictionary(qd_dict,tolower = FALSE)
  
  # Identify geo tags (states or reg or org ...)
  toks_tags <- tokens_lookup(toks, qd_dict, case_insensitive = F)
  toks_tags <- lapply(toks_tags, unique)
  toks_tags<-as.tokens(toks_tags)
  list_tags<-function(x){res<-paste(x, collapse=' ')}
  docvars(qd)[[tagsname]]<-as.character(lapply(toks_tags,FUN=list_tags))
  docvars(qd)[[paste("nb",tagsname,sep="")]]<-ntoken(toks_tags)
  
  
  
  # Export results
  return(qd)
   }

3.2 The pandemic topic

3.2.1 Dictionary

We decide here to use lower case transformation. We use a star for the words that can take a plural form.

label <- c("épidémie*", "pandémie*", "virus", "oms", "ébola", "ebola",  "h1n1","sras", "chikungunya", "choléra", "peste")
  code  <- rep("pand", length(label))
  lang  <- rep("fr", length(label))
  dict_pande <- data.frame(code,lang,label)
  kable(dict_pande)
code lang label
pand fr épidémie*
pand fr pandémie*
pand fr virus
pand fr oms
pand fr ébola
pand fr ebola
pand fr h1n1
pand fr sras
pand fr chikungunya
pand fr choléra
pand fr peste
frcomps<-c("virus informatique")

3.2.2 Annotation

qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict_pande,
                       code = "code",
                       alias = "label",
                      tagsname = "pand",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  table(qd$nbpand)

3.2.3 Visualization

x<-data.table(docvars(qd))
  x$tag<-x$nbpand !=0
  tab<-x[,.(tot=.N),by=.(month,tag, who)]
  tab<-tab[tab$tag==TRUE,]
  tab$month<-as.Date(tab$month)
  
         
         p<-ggplot(tab, aes(x=month,fill =who, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Pandemic : distribution of tags by month and media",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

3.3 The quake topic

3.3.1 Dictionary

We decide here to use lower case transformation. We use a star for the words that can take a plural form.

label <- c("séisme*", "sismiqu*", "tremblement de terre", "tremblements de terre", "tsunami*", "USGS","échelle de Richter", "secousse tellurique")
  code  <- rep("quak", length(label))
  lang  <- rep("fr", length(label))
  dict_quak <- data.frame(code,lang,label)
  kable(dict_quak)
code lang label
quak fr séisme*
quak fr sismiqu*
quak fr tremblement de terre
quak fr tremblements de terre
quak fr tsunami*
quak fr USGS
quak fr échelle de Richter
quak fr secousse tellurique
frcomps<-c("séisme politique")

3.3.2 Annotation

qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict_quak,
                       code = "code",
                       alias = "label",
                       tagsname = "quak",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  table(qd$nbquak)

3.3.3 Visualization

x<-data.table(docvars(qd))
  x$tag<-x$nbquak !=0
  tab<-x[,.(tot=.N),by=.(month,tag, who)]
  tab<-tab[tab$tag==TRUE,]
  tab$month<-as.Date(tab$month)
  
         
         p<-ggplot(tab, aes(x=month,fill =who, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Earthquakes : distribution of tags by month and media",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

3.4 The Migrant & Refugees topic

3.4.1 Dictionary

In this last example, we introduce two subtopics corresponding to migrants and refugees or asylum seeker.The objective is to analyze a possibility of semantic shift.

label <- c("migrant*", "émigrant*", "immigrant*", "migrat*", "réfugié*", "demandeur d'asile")
  code  <- c("migr","migr","migr","migr", "refu","refu")
  lang  <- rep("fr", length(label))
  dict_mobil <- data.frame(code,lang,label)
  kable(dict_mobil)
code lang label
migr fr migrant*
migr fr émigrant*
migr fr immigrant*
migr fr migrat*
refu fr réfugié*
refu fr demandeur d’asile
frcomps<-c("oiseaux migrateurs")

3.4.2 Annotation

qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict_mobil,
                       code = "code",
                      alias = "label",
                      tagsname = "mobil",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = TRUE)
  
  table(qd$mobil)

3.4.3 Visualization

x<-data.table(docvars(qd))
  x$tag<-as.factor(x$mobil)
  levels(x$tag)<-c(NA,"Migrants","Migrants & Refugies", "Refugiés","Migrants & Refugies" )
  tab<-x[,.(tot=.N),by=.(month,tag)]
  tab<-tab[is.na(tab$tag)==FALSE,]
  tab$month<-as.Date(tab$month)
  
         
         p<-ggplot(tab, aes(x=month,fill =tag, y=tot))+
           geom_bar(stat="identity")+
           ggtitle(label ="Mobility : distribution of tags by month and qualification",
                    subtitle = "1st Jan 2014 to 31th Dec.  2015")
  p

3.5 Store results

We store the quanteda file which combine geographical and thematic tags :

saveRDS(qd,"data/corpus/qd_mycorpus_states_topics.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/qd_mycorpus_states_topics.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  6.866 Mo"

4 Hypercubes creation

This section is based on the TELEMAC application elaborated during the H2020 projected ODYCCEUS and presented in the paper published in the journal Frontiers and available at https://analytics.huma-num.fr/Claude.Grasland/telemac/

Our objective is to elaborate an hypercube organised by different dimensions. As an example, we suppose that we are interested in the analysis of the crisis of migrant and refugees (what) in different newspapers (who), at different period of time (when) and we want to explore the locations of countries that are mentioned (where) and eventually associated together (where1.where2). Finally we want to distinguish inside the news the possible changes of results if we consider the title or the first, second and third sentences of the description (order).

4.1 Definition of dimensions

To illustrate this different options, we can look at the example of a news published by the Algerian newspaper El Watan the 16th September 2015 and divided in a title and three sentences of description.

qd<-readRDS("data/corpus/qd_mycorpus_states_topics.RDS")
  
  examp<-corpus_subset(qd,docid(qd) == 9486265)
  
  kable(paste(examp))
x
Crise des réfugiés en Europe : Vers un conseil des chefs d’Etat et de gouvernement de l’UE .
L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire.
Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP.
«C’est un problème pour l’Union européenne dans son ensemble, c’est pourquoi nous nous sommes prononcés pour la tenue la semaine prochaine d’un conseil extraordinaire de l’UE», a déclaré la chancelière allemande lors d’une conférence de presse avec son homologue autrichien Werner Faymann.

Thanks to the previous operations of geographical and topical tagging, we can propose a simplified table where the text of the news has been removed and where we keep only the information of interest for the agregation procedure.

examp$id<-as.character(docid(examp))
  dtexamp<-data.table(tidy(examp)) %>% select(id=id, order = order, who = who, when=day, what=mobil, where1 = states, where2=states)
  kable(dtexamp)
id order who when what where1 where2
9486265 1 fr_DZA_elwata 2015-09-16 refu
9486265 2 fr_DZA_elwata 2015-09-16 migr DEU AUT SVK DEU AUT SVK
9486265 3 fr_DZA_elwata 2015-09-16 refu migr BEL DEU AUT SVK BEL DEU AUT SVK
9486265 4 fr_DZA_elwata 2015-09-16 DEU AUT DEU AUT

The hypercube is the result of an aggregation of foreign news according several dimensions:

  • who : this dimension is related to the variable which describe the media outlets which published the RSS feeds. Each source is related to a codell_sss_xxxxxx where llis the language, sssis the ISO3 code of the country and xxxxxxthe name of the media. For instance, a RSS feed produced by the Algerian newspaper El Watan is identified by the code who = fr_DZA_elwata. Starting from there, it is then possible to proceed to aggregation of the data by group of languages (eg. computation of the indicators for all the French speaking newspapers) or countries (compute the indicators for all the media outlets located in Algeria).

  • when : this dimension describe the day when an article of the RSS feeds has been published, according to a reference time zone (Paris in present case). Starting from the day, the data will be further aggregated according to different period of aggregation: weeks, months, quarters or years. . For instance, by choosing to work on monthly aggregated data, the first period of observation for the news presented as example will be: when = 2015-09-01. If we choose a division in weeks, we have to decide if the week start on Sunday (default option of R) or start on Monday (option adopted in present case)

  • where1 and where2 : this dual dimension is associated to the cross-list of foreign countries detected by the country dictionary in the news. For example the second sentence of our exampple (“L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire.”) has produced a list of three places (DEU,AUT,SVK) associated to the cross-list of nine couple of places (AUT-AUT, AUT-DEU, AUT-SVK, DEU-DEU, DEU-AUT, DEU-SVK, SVK-AUT, SVK-DEU, SVK-SVK) where each couple will receive a weight of 1/9. It is important to keep in mind that the countries where the media are located (mentioned in the who dimension) should be excluded from the list if we decide to work only on foreign news.

  • what : In general, this dimension can be described as a boolean value (TRUE/FALSE) which precise if the news is associated or not to the topic of interest. For example the title and the two first sentences of our example are associated to the topic of international mobility but not the third sentence where the expected keywords has not been found. But if we have introduced subtopics, the situation is more complex because the news can be associated to different subtopics (as it was associated to different states). For example the second sentence of the description (“Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP”) is associated to 2 subtopics (refug, migr) and 4 countries (BEL, AUT, DEU, SVK). It will therefore be broken in 2 x 4 x4 = 32 pieces of information, each of them associated to a value of 1/16th.

  • order : To build the hypercube, it is possible to works on different size of text units: (order=1): the title or the first sentence, or (order = 2,3,4, ...): the title with the selected number of sentence of the description available. This parameter is important because some results, especially regarding the spatial dimension of the analysis (where) are more noticeable on longer texts. In our example, it is clear that the conclusions would be different if we had decided to focus only on the title which does not mention any country and is only associated to the subtopic of refugees.

4.2 Aggregation function

The elaboration of the hypercube is based on the crossing of all dimensions with one line for each singular combination. To do that, we have elaborated a specific function that combine all the 6 dimensions but can be easily adapted if less dimensions are needed.

#' @title create an hypercube
  #' @name hypercube
  #' @description create a network of interlinked states
  #' @param corpus a corpus of news in quanteda format
  #' @param order an order of sentences in the news
  #' @param who the source dimension
  #' @param when the time dimension
  #' @param timespan aggreation of time
  #' @param what a list of topics
  #' @param where1 a list of states
  #' @param where2  a list of states
  
  
  hypercube   <- function( corpus = qd,
                          order = "order",
                          who = "source",
                          when = "when",
                          timespan = "week",
                          what = "what",
                          where1 = "where1",
                          where2 = "where2")
  {
  
  
    
  # prepare data
  
    don<-docvars(corpus)
    
    df<-data.table(id     = docid(corpus),
                   order  = don[[order]],
                   who    = don[[who]],
                   when   = don[[when]],
                   what   = don[[what]],
                   where1 = don[[where1]],
                   where2 = don[[where2]])
  
    # adjust id
   df$id<-paste(df$id,"_",df$order,sep="")
   
  # change time span
    df$when<-as.character(cut(as.Date(df$when), timespan, start.on.monday = TRUE))
  
  # unnest where1
    df$where1[df$where1==""]<-"_no_"
    df<-unnest_tokens(df,where1,where1,to_lower=F)
    
  # unnest where2
    df$where2[df$where2==""]<-"_no_"
    df<-unnest_tokens(df,where2,where2,to_lower=F) 
    
  # unnest what
    df$what[df$what==""]<-"_no_"
    df<-unnest_tokens(df,what,what,to_lower=F) 
    
  
  
  # Compute weight of news
    newswgt<-df[,list(wgt=1/.N),list(id)]
    df <- merge(df,newswgt, by="id")
  
  
  # ------------------------ Hypercube creation --------------------#
    
    
  # Aggregate
    hc<- df[,.(tags = .N, news=sum(wgt)) ,.(order,who, when,where1,where2, what)]
    
  # Convert date to time
    hc$when<-as.Date(hc$when)
    
  # export
    return(hc)
    
  }

In order to test the function, we apply it firstly on our small example of the single news published by El Watan

hc_test<-hypercube( corpus   = examp,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "day",
                      what     = "mobil",
                      where1   = "states",
                      where2   = "states")
  kable(hc_test)
order who when where1 where2 what tags news
1 fr_DZA_elwata 2015-09-16 no no refu 1 1.0000000
2 fr_DZA_elwata 2015-09-16 DEU DEU migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 DEU AUT migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 DEU SVK migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 AUT DEU migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 AUT AUT migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 AUT SVK migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 SVK DEU migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 SVK AUT migr 1 0.1111111
2 fr_DZA_elwata 2015-09-16 SVK SVK migr 1 0.1111111
3 fr_DZA_elwata 2015-09-16 BEL BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 BEL SVK migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 DEU SVK migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 AUT SVK migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK BEL refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK BEL migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK DEU refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK DEU migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK AUT refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK AUT migr 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK SVK refu 1 0.0312500
3 fr_DZA_elwata 2015-09-16 SVK SVK migr 1 0.0312500
4 fr_DZA_elwata 2015-09-16 DEU DEU no 1 0.2500000
4 fr_DZA_elwata 2015-09-16 DEU AUT no 1 0.2500000
4 fr_DZA_elwata 2015-09-16 AUT DEU no 1 0.2500000
4 fr_DZA_elwata 2015-09-16 AUT AUT no 1 0.2500000
  • order = 1 : the title is described by a single line because we have only one subtopic and no states mentioned. The weight of the line is 1.
  • order = 2 : the first sentence of description is characterized by one subtopic and three different states which produce 9 lines with weight of 1/9 = 0.111 news.
  • order = 3 : the second sentence of description is characterized by two subtopic and four diffeent states which produce 32 lines with weight of 1/32 = 0.031 news.
  • order = 4 : the last sentence of description is characterized by no topics and 2 states which produce 4 lines with weight of O.25.

4.3 Application

Of course it is not interesting to transform a single news in such a large table. But it is of high interest if we realize the agregation on a large number of news. Because in this case the number of combination of dimensions is limited and we can obtain a synthetic table called hypercube that summarize all the information extracted from the news in a relatively small object. The time of computation of an hypercube can be relatively large and the memory size necessary to the intermediary step of disagregation can be important, but the resulting object is small and very adapted for a large number of exploration and modelisation methods.

In practice, the function based on data.table package appears to be very fast and it appears interesting to build an hypercube for each of the topics of interest.

4.3.1 Pandemic hypercube

hc<-hypercube( corpus   = qd,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "day",
                      what     = "pand",
                      where1   = "states",
                      where2   = "states")
  
  saveRDS(hc,"data/corpus/hc_mycorpus_states_pand.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_pand.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  0.384 Mo"

4.3.2 Quake hypercube

hc<-hypercube( corpus   = qd,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "day",
                      what     = "pand",
                      where1   = "states",
                      where2   = "states")
  
  saveRDS(hc,"data/corpus/hc_mycorpus_states_quak.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_quak.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  0.384 Mo"

4.3.3 Mobility hypercube

hc<-hypercube( corpus   = qd,
                      order    = "order",
                      who      = "who",
                      when     = "when",
                      timespan = "day",
                      what     = "mobil",
                      where1   = "states",
                      where2   = "states")
  
  saveRDS(hc,"data/corpus/hc_mycorpus_states_mobil.RDS")
  paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_mobil.RDS")/1000000,3), "Mo")
[1] "Size of resulting file =  0.39 Mo"

5 Hypercubes exploration

#source("pgm/hypernews_functions_V6.R")

5.1 Objectives

The different dimensions of an hypercube can be analysed through different aggregation of the dimensions of the hypercubes, leading to different tables authorizing different modes of visualization. Each function is named according to the dimensions that are combined. Each function will produce two different outputs, a statistical table and an interactive graphic

5.1.1 Statistical table

Whatever the dimensions we decide to cross, we build a table where we realize a statistical test in order to identify the cells that are characterized by positive or negative outliers i.e. cells where the phenomena of interest (WHAT) is significantly more present or less present than usual. More precisely, the function will produce two for each cell of the cross dimension table :

  • a salience index (Xobs/Xest) : defined as the ratio between observed and estimated number of news where the topic is present.
  • an outlier index (prob (Xobs > Xest)) : defined as the probability that the number of news where the topic is present is significantly greater than expected.

In both cases we introduce two parameters of control that will limit the computation of indexes to the cells where it appears statistically relevant to realize the measure :

  • Minimum sample size (minsamp) : is the total number of news present in the cell before to compute the probability of apparition of the topic. The default value is equal to 20 as we consider as not meaningfull to compute a proportion on a smaller sample.

  • Minimum estimated value (mintest): is the threshold of computation of the chi-square test according to the estimated number of news where the topic is present. According to statistical rules of the chi-square test, this threshold should be equal to 5 for optimal conditions of application. The package R introduce indeed a warning message if the condition is not satisfied, which can increase the time of computation.

Of course, the user can decide to relax or reinforce these two conditions but it is normally better to avoid to do it. When conditions are not fulfilled, the graphic output will not display the cells where the indexes can not be computed.

The function that realize the test is the following one

#### ---------------- testchi2 ----------------
  #' @title  Compute the average salience of the topic and test significance of deviation
  #' @name what
  #' @description create a table and graphic of the topic
  #' @param tabtest a table with variable trial, success and null.value
  #' @param minsamp : Threshold of sample size requested for salience computation
  #' @param mintest : Threshold of estimated value requested for chi-square test
  
  
  testchi2<-function(tabtest=tabtest,
                     minsamp = 20,
                     mintest = 5) 
  {
    tab<-tabtest
    n<-dim(tab)[1]
    
    # Compute salience if sample size sufficient (default : N>20)
    tab$estimate <-NA
    tab$salience <-NA
    tab$chi2<-NA
    tab$p.value<-NA
    if (tab$trial > minsamp){ tab$estimate<-round(tab$success/tab$trial,5)
    tab$salience<-tab$estimate/tab$null.value
    
    # Chi-square test if estimated value sufficient (default : Nij* > 5)
    
    for (i in 1:n) {
      if(tab$trial[i]*tab$null.value[i]>=mintest) {  
        test<-prop.test(x=tab$success[i],n=tab$trial[i], p=tab$null.value[i], 
                        alternative = "greater")
        tab$chi2[i]<-round(test$statistic,2)
        tab$p.value[i]<-round(test$p.value,5)
      } 
    }
    }
    return(tab)
  }

5.1.2 Interactive graphic

Once the statistical table has been computed, the user can choose between two different visualizations, based on the salience index (exploration) or the chi-square test (ouliers detection). In both case the result will be an interactive figure realized in plotly where it is possible to click on each cell and have a look at the statistical parameters.

The user interested in static graphic (e.g. for publication) can easily adapt the program and realize new functions, for example in ggplot2.

In order to illustrate each type of graphic, we will choose the example of the topic of mobility without distinction between migrants and refugees.

hc <- readRDS("data/corpus/hc_mycorpus_states_mobil.RDS")

5.2 Topic frequence (What ?)

The first function has only one dimension and evaluate the proportion of news related to the topic. As a consequence, this function is not associated to a statistical test and return only a table and a graphic presenting the proportion of news where the topic is present or not.

5.2.1 Function

### ---------------- what ----------------
  #' @title  Compute the average salience of the topic
  #' @name what
  #' @description create a table and graphic of the topic
  #' @param hc an hypercube prepared as data.table
  #' @param subtop a subtag of the main tag (default = NA)
  #' @param title Title of the graphic
  
  
  what <- function (hc = hypercube,
                    subtop = NA,
                    title = "What ?")
  {
   
    
  tab<-hc
  if (is.na(subtop)){tab$what <-tab$what !="_no_"}else {tab$what <- tab$what == subtop}
  
  tab<-tab[,list(news = sum(news)),by = what]
  tab$pct<-100*tab$news/sum(tab$news)
  
  p <- plot_ly(tab,
               labels = ~what,
               values = ~pct,
               type = 'pie') %>%
    layout(title = title,
           xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
           yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
  
  output<-list("table" = tab, "plotly" =p)
  
  return(output)
  
  }

5.2.2 Example

res_what <- what(hc = hc,
               subtop = NA,
               title = "Topic news")
  res_what$table
    what   news       pct
  1: FALSE 112482 97.272476
  2:  TRUE   3154  2.727524
res_what$plotly

The table indicate that 3154 news was associated to the topic which represent 2.72% of the total.

5.3 Topic variation by media (who.what)

The function who.what explore the variation of interest for the topic in the different media of the corpus.

5.3.1 Function

#### ---------------- who.what ----------------
  #' @title  visualize variation of the topic between media
  #' @name who.what
  #' @description create a table of variation of the topic by media
  #' @param hc an hypercube prepared as data.table
  #' @param test : visualize test (TRUE) or salience (FALSE)
  #' @param minsamp : Threshold of sample size requested for salience computation
  #' @param mintest sample size of estimate for chi-square test (default = 5)
  #' @param title Title of the graphic
  
  
  who.what <- function (hc = hypercube,
                        test = FALSE,
                        minsamp = 20,
                        mintest = 5,
                        title = "Who says What ?")
  {
    
    tab<-hc
    {tab$what <-tab$what !="_no_"}
    
    tab<-tab[,list(trial = sum(news),success=round(sum(news*what),0)),by = list(who)]
    ref <-round(sum(tab$success)/sum(tab$trial),4)
    tab$null.value<-ref
    
    tab<-testchi2(tabtest=tab,
                  minsamp = minsamp,
                  mintest = mintest)
    
    
    
    if (test==FALSE) {tab$index =tab$salience
    tab<-tab[tab$trial > minsamp,]
    mycol<-brewer.pal(7,"YlOrRd")
    } 
    else {tab$index=tab$p.value
    tab<-tab[tab$trial*tab$null.value>mintest,]
    mycol<-brewer.pal(7,"RdYlBu")
    mycol[4]<-"lightyellow"
    }
    
    p <- plot_ly(tab,
                 x = ~who,
                 y = ~estimate*100,
                 color= ~index,
                 colors= mycol,
                 hoverinfo = "text",
                 text = ~paste('Source: ',who,
                               '<br /> Total news  : ', round(trial,0),
                               '<br /> Topic news : ', round(success,0),
                               '<br /> % observed  : ', round(estimate*100,2),'%',
                               '<br /> % estimated : ', round(null.value*100,2),'%',
                               '<br /> Salience : ', round(salience,2),  
                               '<br /> p.value : ', round(p.value,4)),
                 type = "bar")  %>%
      layout(title = title,
             yaxis = list(title = "% news"),
             barmode = 'stack')
    
    output<-list("table" = tab, "plotly" =p)
    
    return(output)
    
  }

5.3.2 Example

We present here the statistical table and the two types of graphics that can be produced. In the following case we will only present the outlier graphic.

res_who_what<- who.what(hc=hc, 
                          test = FALSE,
                          minsamp = 20,
                          mintest = 5,
                          title = "Topic news by media - Salience")
  
  kable(res_who_what$table)
who trial success null.value estimate salience chi2 p.value index
fr_FRA_libera 28827 607 0.0273 0.02106 0.7714286 42.08 1.00000 0.7714286
fr_DZA_elwata 11581 292 0.0273 0.02521 0.9234432 1.82 0.91137 0.9234432
fr_BEL_derheu 16358 349 0.0273 0.02134 0.7816850 21.69 1.00000 0.7816850
fr_FRA_figaro 22999 1005 0.0273 0.04370 1.6007326 232.26 0.00000 1.6007326
fr_BEL_lesoir 26057 652 0.0273 0.02502 0.9164835 5.01 0.98737 0.9164835
fr_DZA_xpress 9814 249 0.0273 0.02537 0.9293040 1.30 0.87310 0.9293040
res_who_what$plotly 
res_who_what<- who.what(hc=hc, 
                          test = TRUE,
                          minsamp = 5,
                          mintest = 1,
                          title = "Topic news by media - Significance")
  res_who_what$plotly

The analysis reveal a clear over-representation of the topic in the french newspaper Le Figaro (4.37% of news) as compared to the other media (2.1 to 2.5%).

5.4 Topic variation through time (when.what)

In this case we want to analyze if the topic has been more or less present at one period of time or another. It can therefore be interesting to modify the level of agregation before to do that and transform the initial hypercube (by day) toward another level of agregation. It is also possible to change the size of the time period as the outlier are defined by reference to the whole period of analysis

5.4.1 Function

#### ---------------- when.what ----------------
  #' @title  visualize variation of the topic through time
  #' @name when.what
  #' @description create a table of variation of the topic by media
  #' @param test : visualize test (TRUE) or salience (FALSE)
  #' @param minsamp : Threshold of sample size requested for salience computation
  #' @param mintest sample size of estimate for chi-square test (default = 5)
  #' @param title Title of the graphic
  
  
  when.what <- function (hc = hypercube,
                         test = FALSE,
                         minsamp = 20,
                         mintest = 5,
                         title = "Who says What ?")
  {
    
    tab<-hc
    {tab$what <-tab$what !="_no_"}
    
    tab<-tab[,list(trial = sum(news),success=round(sum(news*what),0)),by = list(when)]
    ref <-round(sum(tab$success)/sum(tab$trial),4)
    tab$null.value<-ref
    
    tab<-testchi2(tabtest=tab,
                  minsamp = minsamp,
                  mintest = mintest)
    
    if (test==FALSE) {tab$index =tab$salience
    tab<-tab[tab$trial > minsamp,]
    mycol<-brewer.pal(7,"YlOrRd")
    } 
    else {tab$index=tab$p.value
    tab<-tab[tab$trial*tab$null.value>mintest,]
    mycol<-brewer.pal(7,"RdYlBu")
    mycol[4]<-"lightyellow"
    }
    
    
    p <- plot_ly(tab,
                 x = ~as.character(when),
                 y = ~estimate*100,
                 color= ~index,
                 colors= mycol,
                 hoverinfo = "text",
                 text = ~paste('Time: ',when,
                               '<br /> Total news  : ', round(trial,0),
                               '<br /> Topic news : ', round(success,0),
                               '<br /> % observed  : ', round(estimate*100,2),'%',
                               '<br /> % estimated : ', round(null.value*100,2),'%',
                               '<br /> Salience : ', round(salience,2),  
                               '<br /> p.value : ', round(p.value,4)),
                 type = "bar")  %>%
      layout(title = title,
             yaxis = list(title = "% news"),
             barmode = 'stack')
    
    output<-list("table" = tab, "plotly" =p)
    
    return(output)
    
  }

5.4.2 Example 1 : 2014-2015 by month

# Modify time period by month
  hc2 <- hc %>% mutate(when = cut(when,breaks="month"))
  
  
  res_when_what<- when.what(hc=hc2, 
                            test=TRUE,
                            minsamp=10,
                            mintest=5,
                            title = "Topic news by month - Significance")
  
  
  res_when_what$plotly

The analysis reveals clear discontinuities in the timeline of the topic. We start with a low level (0.5 to 1.2%) from January 2014 to March 2015, followed by a brutal jump in April-June 2015 (3 to 5%) and a major peak in september 2015 (15.8% of news). At the end of the period, the level is clearly higher than at the beginning.

5.4.3 Example 2 : 2015 by week

# Modify time period by month
  hc2 <- hc %>% filter(substr(when,1,4)=="2015") %>% mutate(when = cut(when,breaks="week"))
  
  
  res_when_what<- when.what(hc=hc2, 
                            test=TRUE,
                            minsamp=10,
                            mintest=5,
                            title = "Topic news by week - Significance")
  
  
  res_when_what$plotly

5.4.4 Example 3 : August-Nov 2015 by day

# Modify time period by month
  hc2 <- hc %>% filter(when > as.Date("2015-07-31"), when < as.Date("2015-11-01")) 
  
  
  res_when_what<- when.what(hc=hc2, 
                            test=TRUE,
                            minsamp=10,
                            mintest=5,
                            title = "Topic news by day - Significance")
  
  
  res_when_what$plotly

5.5 Topic variation through space (where.what)

This function analyze the spatial distribution of places associated to the topic. As we have only collected states, we do not take into account the news where the topic of interest is associated to geographical area different from states (e.g. “migrants from subsaharan Africa”). But it is only a minority of cases and the fact to collect states make possible to produce easily a geographical map of the phenomena.

5.5.1 Function

#### ---------------- where.what ----------------
  #' @title  visualize spatialization of the topic 
  #' @name where.what
  #' @description create a table of variation of the topic by media
  #' @param hc an hypercube prepared as data.table
  #' @param test : visualize test (TRUE) or salience (FALSE)
  #' @param minsamp : Threshold of sample size requested for salience computation
  #' @param mintest sample size of estimate for chi-square test (default = 5)
  #' @param map a map with coordinates in lat-long
  #' @param proj a projection accepted by plotly
  #' @param title Title of the graphic
  
  
  where.what <- function (hc = hypercube,
                          test = FALSE,
                          minsamp = 20,
                          mintest = 5,
                          map = world_ctr,
                          proj = 'azimuthal equal area',
                          title = "Where said What ?")
  {
    
    tab<-hc
    tab$what <-tab$what !="_no_"
    
    tab<-tab[,list(trial = round(sum(news),0),success=round(sum(news*what),0)),by = list(where1)]
    ref <-round(sum(tab$success)/sum(tab$trial),4)
    tab$null.value<-ref
    
    tab<-testchi2(tabtest=tab,
                  minsamp = minsamp,
                  mintest = mintest)
    
    
    
    tab<-tab[order(-chi2),]
    
    
    
    if (test==FALSE) {tab$index =tab$salience
    tab<-tab[tab$trial > minsamp,]
    mycol<-brewer.pal(7,"YlOrRd")
    } 
    else {tab$index=tab$p.value
    tab<-tab[tab$trial*tab$null.value>mintest,]
    mycol<-brewer.pal(7,"RdYlBu")
    mycol[4]<-"lightyellow"
    }
    
    
    map<-merge(map,tab,all.x=T,all.y=F,by.x="ISO3",by.y="where1")
    
    
    
    #map2<-map[is.na(map$pct)==F,]
    #map2<-st_centroid(map2)
    #map2<-st_drop_geometry(map2)
    
    
    g <- list(showframe = TRUE,
              framecolor= toRGB("gray20"),
              coastlinecolor = toRGB("gray20"),
              showland = TRUE,
              landcolor = toRGB("gray50"),
              showcountries = TRUE,
              countrycolor = toRGB("white"),
              countrywidth = 0.2,
              projection = list(type = proj))
    
    
    
    p<- plot_geo(map)%>%
      add_markers(x = ~lon,
                  y = ~lat,
                  sizes = c(0, 250),
                  size = ~success,
                  #             color= ~signif,
                  color = ~index,
                  colors= mycol,
                  hoverinfo = "text",
                  text = ~paste('Location: ',NAME,
                                '<br /> Total news  : ', round(trial,0),
                                '<br /> Topic news : ', round(success,0),
                                '<br /> % observed  : ', round(estimate*100,2),'%',
                                '<br /> % estimated : ', round(null.value*100,2),'%',
                                '<br /> Salience : ', round(salience,2),  
                                '<br /> p.value : ', round(p.value,4))) %>%
      
      layout(geo = g,
             title = title)
    
    
    
    output<-list("table" = tab, "plotly" =p)
    
    return(output)
    
  }

5.5.2 Example

When we realize the map, we eliminate the news related to the topic where no countries has been mentioned. As a consequence the reference value is modified : in the whole sample 2.73% of news was related to the topic but in the sample of news where one country is mentioned 2.83% of the news are related the topic.

As the total number of news can be small in some countries, we have reduced here the parameters of the statistical test in order to visualize more countries on the map. It is therefore necessary to be cautious in the analysis of results.

map<-readRDS("data/dico_states/world_ctr_4326.Rdata")
  hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_")
  
  res_where_what<- where.what(hc=hc2,
                              test=TRUE,
                              minsamp=10,
                              map = map, 
                              mintest =2,
                              title = "Topic news by states - Significance")
  res_where_what$plotly

The analysis reveals that some countries are “specialized” in the topic during the period of observation. For example 53.5% of the news about Hungary was associated to the question of migrants and refugees, which is obviously related to the mediatization of the wall established by Viktor Orban in 2015. Other countries are characterized on the contrary by an under-representation of the topic like the USA where the topic is only associated to 0.7% of news. But the situation will change after Donald Trump’s election who will also establish a wall which will dramatically increase the number of news about USA and migrants.

5.6 Crossing 3 dimensions ?

It is of course possible to cross more than two dimensions but in this case it is necessary to have a larger sample of news because statistical tests are not possible when the numberof news is too small in the cells. The application TELEMAC, realized during the H2020 project OYCCEUS provides example of analysis combining who.where.what, who.when.what or where.when.what. It is based on a shiny interface with a lot of possibilities of interaction with the data that has not been developed here.

6 Spatial Networks

6.1 Introduction

This final section will focus on the spatial dimension of the analysis and more precisely on the spatial linkages that can be revealed by the news when two countries or more are present in the same sentence. Before to analyze real data, it is important to start with a very small theoretical example in order to explain more clearly what are these spatial linkages.

Consider the following set of 8 news published in 8 languages :

text1 <- "Russia annexes Crimea against Ukraine's advice"
  text2 <- "La France, l'Allemagne et les USA protestent contre l'intervention de l'armée russe en Ukraine"
  text3 <- "Russlands Intervention in Syrien wird von China unterstützt"
  text4 <- "Gli Stati Uniti si rifiutano di intervenire in Siria contro il parere della Francia "
  text5 <- "Trump está construyendo un muro entre Estados Unidos y México "
  text6 <- "Bangladeş, Myanmar'dan daha fazla mülteci kabul etmeyi reddediyor"
  text7 <- "פלסטינים ירו רקטות איראניות לעבר ישראל "
  text8 <-"ضحايا الفلسطينيون للهجوم الإسرائيلي المدعوم من الولايات المتحدة "
  text<-c(text1,text2,text3,text4,text5, text6,text7, text8)
  kable(text)
x
Russia annexes Crimea against Ukraine’s advice
La France, l’Allemagne et les USA protestent contre l’intervention de l’armée russe en Ukraine
Russlands Intervention in Syrien wird von China unterstützt
Gli Stati Uniti si rifiutano di intervenire in Siria contro il parere della Francia
Trump está construyendo un muro entre Estados Unidos y México
Bangladeş, Myanmar’dan daha fazla mülteci kabul etmeyi reddediyor
פלסטינים ירו רקטות איראניות לעבר ישראל
ضحايا الفلسطينيون للهجوم الإسرائيلي المدعوم من الولايات المتحدة

These news has been written in five different languages, but if we apply a method of geocoding where we try just to recognize the presence of states mentionned, we can produce three new sentences using the same language base on on ISO3 code of countries.

text1 <- "RUS UKR"
  text2 <- "FRA DEU USA RUS UKR"
  text3 <- "RUS SYR CHN"
  text4 <- "USA SYR FRA "
  text5 <- "USA MEX"
  text6 <- "BGD MMR"
  text7 <- "PSE ISR IRN"
  text8 <- "PSE ISR USA"
  
  text<-c(text1,text2,text3,text4,text5, text6, text7,text8)
  kable(text)
x
RUS UKR
FRA DEU USA RUS UKR
RUS SYR CHN
USA SYR FRA
USA MEX
BGD MMR
PSE ISR IRN
PSE ISR USA

Now imagine that we put this 8 sentences in quanteda (or any text mining package) as a corpus :

qd<-corpus(text)
  summary(qd)
Corpus consisting of 8 documents, showing 8 documents:

    Text Types Tokens Sentences
   text1     2      2         1
   text2     5      5         1
   text3     3      3         1
   text4     3      3         1
   text5     2      2         1
   text6     2      2         1
   text7     3      3         1
   text8     3      3         1

We can see that our ISO3 code has been identified as tokens which can be analyze as the element of a new language which is universal as long as we consider our dictionaries of state identification as correct.

It is now possible to use powerful tool of textual analysis which is the function fcm (feature co-occurence matrix) which compute all cases of co-occurence of two tokens in the same sentence.

geo_fcm<-fcm(qd, tri=FALSE)
  kable(as.matrix(geo_fcm))
RUS UKR FRA DEU USA SYR CHN MEX BGD MMR PSE ISR IRN
RUS 0 2 1 1 1 1 1 0 0 0 0 0 0
UKR 2 0 1 1 1 0 0 0 0 0 0 0 0
FRA 1 1 0 1 2 1 0 0 0 0 0 0 0
DEU 1 1 1 0 1 0 0 0 0 0 0 0 0
USA 1 1 2 1 0 1 0 1 0 0 1 1 0
SYR 1 0 1 0 1 0 1 0 0 0 0 0 0
CHN 1 0 0 0 0 1 0 0 0 0 0 0 0
MEX 0 0 0 0 1 0 0 0 0 0 0 0 0
BGD 0 0 0 0 0 0 0 0 0 1 0 0 0
MMR 0 0 0 0 0 0 0 0 1 0 0 0 0
PSE 0 0 0 0 1 0 0 0 0 0 0 2 1
ISR 0 0 0 0 1 0 0 0 0 0 2 0 1
IRN 0 0 0 0 0 0 0 0 0 0 1 1 0

We obtain a matrix which is in fact equivalent to a weighted network of co-coccurence of countries. A quick and dirty visualization of these object with a function of the package quanteda.textplots :

textplot_network(geo_fcm)

This example is sufficient to demonstrate the interest of geo network for the analysis of press, especially if we adopt a cross language perspective. The reader familiar with network analysis will easily find the tools adapted for the visualization and analysis of networks derived from the fcm matrix.

6.2 Geo networks modelisation

We propose here an example of modelisation of geo network that is specifically adapted to the hypercube structure that we have developed previously. What we propose here is not to identify the geographical units that are the most related but the geographical units that are more related than expected according to a random model. We have elaborated for that a pipeline of functions

6.2.1 Hypercube Filter (function)

This function is dedicated to the selection of some parts of an hypercube

hc_filter <- function(don = hc,
                        who = "who",
                        when = "when",
                        where1 = "where1",
                        where2 = "where2",
                        wgt = "tags",
                        self = FALSE,
                        when_start = NA,
                        when_end = NA,
                        who_exc = NA,
                        who_inc = NA,
                        where1_exc = NA,
                        where1_inc = NA,
                        where2_exc = NA,
                        where2_inc = NA)
  
    {                          
    
      df<-data.table(who = don[[who]],
                     when = don[[when]],
                     where1 = don[[where1]],
                     where2 = don[[where2]],
                     wgt = don[[wgt]])
      
      # Select time period
          if (is.na(when_start)==FALSE) { 
          df <- df[when >= as.Date(when_start), ]}
          if (is.na(when_end)==FALSE) { 
          df <- df[when <= as.Date(when_end), ]}
      # Select who
          if (is.na(who_exc)==FALSE) { 
          df <- df[!(who %in% who_exc), ]}
          if (is.na(who_inc)==FALSE) { 
          df <- df[(who %in% who_inc), ]}
      # Select where1
          if (is.na(where1_exc)==FALSE) { 
          df <- df[!(where1 %in% where1_exc), ]}
          if (is.na(where1_inc)==FALSE) { 
          df <- df[(where1 %in% where1_inc), ]}
      # Select where2
          if (is.na(where2_exc)==FALSE) { 
          df <- df[!(where2 %in% where2_exc), ]}
          if (is.na(where2_inc)==FALSE) { 
          df <- df[(where2 %in% where2_inc), ]}
      # eliminate internal links
         if (self==FALSE) { 
          df <- df[(where1 != where2), ]}
      return(df)
    
  }

6.2.2 Matrix builder (function)

This function build an interaction matrix (i, j, Fij) between states mentioned in the news. But offers the possibility to eliminate the spatial units that are not sufficiently present according to 5 parameters :

  • s1, s2 : minimum sum of links in lines and columns
  • n1,n2,k : minimum number of links of size greater than k in lines and columns

For example, the default parameters select only the units with a total sum of 10 and a minimum of 2 links greater than 1

build_int <- function(don = don,       # a dataframe with columns i, j , Fij
                        i = "where1",
                        j = "where2",
                        Fij = "wgt",
                        s1 = 10,
                        s2 = 10,
                        n1 = 2,
                        n2 = 2,
                        k = 1)
  
  {  
    df<-data.table(i=don[[i]],j=don[[j]],Fij=don[[Fij]])
    int <-df[,.(Fij=sum(Fij)),.(i,j)]
    int<-dcast(int,formula = i~j,fill = 0)
    mat<-as.matrix(int[,-1])
    row.names(mat)<-int$i
    mat<-mat[apply(mat,1,sum)>=s1,apply(mat,2,sum)>=s2 ]
    m0<-mat
    m0[m0<k]<-0
    m0[m0>=k]<-1
    mat<-mat[apply(m0,1,sum)>=n1,apply(m0,2,sum)>=n2 ]
    int<-reshape2::melt(mat)
    names(int) <-c("i","j","Fij")
    return(int)
  }

6.2.3 Random model (function)

This function solve the random model of allocation of flows (based on marginal sum) and provide as results :

  • Fij : observed flow
  • Eij : expected flow
  • Rabs_ij : absolute residual (Fij-Eij)
  • Rrel_ij : relative residual (Fij/Eij)
  • Rchi_ij : chi-square of the difference between Fij and Eij

It is not recommended to apply this function on a too large table and a parameter maxsize is introduced in order to lmimiit the risk of crash of the program.

rand_int <- function(int = int, # A table with columns i, j Fij
                       maxsize = 100000,
                       diag    = FALSE,
                       resid   = FALSE) {
      # Eliminate diagonal ?
      if (diag==FALSE) { 
          int <- int[as.character(int$i) != as.character(int$j), ]}
    
      # Compute model if size not too large
      if (dim(int)[1] < maxsize) {
         # Proceed to poisson regression model
         mod <- glm( formula = Fij ~ i + j,family = "poisson", data = int)
    
         # Add residuals if requested
         if(resid == TRUE)   { 
            # Add estimates
            int$Eij <- mod$fitted.values
  
            # Add absolute residuals
            int$Rabs_ij <- int$Fij-int$Eij
  
            # Add relative residuals
            int$Rrel_ij <- int$Fij/int$Eij
  
            # Add chi-square residuals
            int$Rchi_ij <-  (int$Rabs_ij)**2 / int$Eij
            int$Rchi_ij[int$Rabs_ij<0]<- -int$Rchi_ij[int$Rabs_ij<0]
            }
           
      } else { paste ("Table > 100000 -  \n 
                       modify maxsize =  parameter \n
                       if you are sure that your computer can do it !")}
    # Export results
    int$i<-as.character(int$i)
    int$j<-as.character(int$j)
    return(int)
    
   }

6.2.4 Visualize network (function)

This standard procedure is dedicated to the visualization of networls with a pleasant javascript interface. We do not explain here in details all parameters but notice that

  • size and minsize : defines the width of edges between nodes
  • test and mintest : defines the criteria of selection of edges. It will be generally the criteria of chi-square > 3.84 if we want to visualize non random associations with p < 0.05.
geo_network<- function(don = don,
                         from = "i",
                          to = "j", 
                          size = "Fij",
                          minsize = 1,
                          maxsize = NA,
                          test = "Fij",
                          mintest = 1,
                          loops  = FALSE, 
                          title = "Network")
  
  {
  int<-data.frame(i = as.character(don[,from]),
                  j = as.character(don[,to]),
                  size = don[,size],
                  test = don[,test]
                  )
  if (is.na(minsize)==FALSE) {int =int[int$size >= minsize,]} 
  if (is.na(maxsize)==FALSE) {int =int[int$size <= maxsize,]} 
  if (is.na(mintest)==FALSE) {int =int[int$test >= mintest,]}
  
  nodes<-data.frame(code = unique(c(int$i,int$j)))
  nodes$code<-as.character(nodes$code)
  nodes$id<-1:length(nodes$code)
  nodes$label<-nodes$code
  nodes$color <-"gray"
  nodes$color[nodes$code %in% int$j]<-"red"
  
  
  # Adjust edge codes
  edges <- int %>% mutate(width = 5+5*size / max(size)) %>%
                  left_join(nodes %>% select(i=code, from = id)) %>%  
                  left_join(nodes %>% select(j=code, to = id )) 
  
  # compute nodesize
  toti<-int %>% group_by(i) %>% summarize(size =sum(size)) %>% select (code=i,size)
  totj<-int %>% group_by(j) %>% summarize(size =sum(size)) %>% select (code=j,size)
  tot<-rbind(toti,totj)
  tot<-unique(tot)
  tot$code<-as.factor(tot$code)
  nodes <- left_join(nodes,tot) %>% mutate(value = 1 +5*sqrt(size/max(size)))
  
  
  #sel_nodes <-nodes %>% filter(code %in% unique(c(sel_edges$i,sel_edges$j)))
  
  # eliminate loops
  
  if(loops == FALSE) {edges <- edges[edges$from < edges$to,]}
  
  net<- visNetwork(nodes, 
                    edges, 
                    main = title,
  height = "1000px", 
                    width = "70%")   %>%   
     visNodes(scaling =list(min =20, max=60, 
                            label=list(min=20,max=80, 
                                      maxVisible = 20)))%>%
    visEdges(scaling = list(min=20,max=60))%>%
         visOptions(highlightNearest = TRUE,
       #               selectedBy = "group", 
      #               manipulation = TRUE,
                    nodesIdSelection = TRUE) %>%
          visInteraction(navigationButtons = TRUE) %>%
           visLegend() %>%
        visIgraphLayout(layout ="layout.fruchterman.reingold",smooth = TRUE)
  
  net
   return(net)
   } 

6.3 Application

We apply our different functions to the hypercube of mobility. For a more pleasant visualization, we replace firstly the ISO3 codes by the full name of the countries.

# Load complete hypercube
  hc <- readRDS("data/corpus/hc_mycorpus_states_mobil.RDS")
  hc <- hc %>% filter(where1 !="_no_", where2 != "_no_")
  
  # Eliminate non foreign news (french news for french newspapers ...)
  hc<-hc[hc$where1 != substr(who,4,6),]
  hc<-hc[hc$where2 != substr(who,4,6),]
  
  # Add complete labels
  map<-readRDS("data/dico_states/world_map_4326.Rdata")
  labs<-st_drop_geometry(map)
  labs<-labs[,c(1,4)]
  
  # Shorten the name of USA
  labs$NAME[labs$ISO3=="USA"]<-"U.S.A."
  
  
  names(labs)<-c("where1","geofr1")
  hc<-left_join(hc,labs)
  names(labs)<-c("where2","geofr2")
  hc<-left_join(hc,labs)
  
  hc_geo_geo <- hc

6.3.1 Reference network

We can firstly realize a network that combine all periods of time (when), all media (who) and all topics (what), which provide a reference for the next analysis.

hc<-hc_geo_geo 
  
  
  
  hc<-hc_filter(don = hc,
                               wgt = "news",
                               where1 = "geofr1",
                               where2 = "geofr2",
                               where1_exc = c("_no_"),
                               where2_exc = c("_no_"),
                               self = FALSE
                             )
  
  int <- build_int(don = hc,
                   s1=2,
                   s2=2,
                   n1=2,
                   n2=2,
                   k=0)
  
  mod<-rand_int(int,
                resid = TRUE,
                diag = FALSE)
  
  
  network<- geo_network(mod,
                        size = "Fij",
                        minsize = 10,
                        test = "Rchi_ij",
                        mintest = 3.84)
  network
  • Comment : The network is a combination of structural proximity (countries that are linked by relation not related to the period of observation) and event linkage related to specific events or crisis that has produced temporary connections generally not observed. The preferential linkage between Israël and Palestine is typically a structural linkage but the association between Liberia, Guinea and Sierra Leone has been induced by the Ebola outbreak. The strong linkage between Ukraine and Russia is a combination of the two types of effects.

6.3.2 Focus on a media

We can isolate a specific media like El Watan in order to examine to what extent his network is different fromthe reference network. But the parameters of selections has to be adapted because the size of the sample is smaller.

hc<-hc_geo_geo %>% filter(who == "fr_DZA_elwata")
  
  
  
  hc<-hc_filter(don = hc,
                               wgt = "news",
                               where1 = "geofr1",
                               where2 = "geofr2",
                               where1_exc = c("_no_"),
                               where2_exc = c("_no_"),
                               self = FALSE
                             )
  
  int <- build_int(don = hc,
                   s1=1,
                   s2=1,
                   n1=1,
                   n2=1,
                   k=0)
  
  mod<-rand_int(int,
                resid = TRUE,
                diag = FALSE)
  
  
  network<- geo_network(mod,
                        size = "Fij",
                        minsize = 2,
                        test = "Rchi_ij",
                        mintest = 3.84)
  network
  • Comments : The network of preferential associations of countries from El Watan is relatively fragmented because of the small number of news where two countries are mentionned. But it reveals nevertheless interesting associations anda clear focus on the Mediterranean countries. The sub-network with Morocco, Spain and Western Sahara is a specific concern of Algeria related to a long term conflict. The focus on the Israelo-Palestinian conflict is also very visible. The so-called “Migrant crisis” is not so visible as in French or Belgium newspapers.

6.3.3 Focus on a topic

We can also propose a focus on the countries that has been specifically associated in relation with a topic. For example, the topic of migrants and refugees. We are oblige to reduce one more time the different thresholds as the number of news is very limited, but the result is interesting.

hc<-hc_geo_geo %>% filter(what != "_no_")
  
  
  
  hc<-hc_filter(don = hc,
                               wgt = "news",
                               where1 = "geofr1",
                               where2 = "geofr2",
                               where1_exc = c("_no_"),
                               where2_exc = c("_no_"),
                               self = FALSE
                             )
  
  int <- build_int(don = hc,
                   s1=1,
                   s2=1,
                   n1=0,
                   n2=0,
                   k=0)
  
  mod<-rand_int(int,
                resid = TRUE,
                diag = FALSE)
  
  
  network<- geo_network(mod,
                        size = "Fij",
                        minsize = 1,
                        test = "Rchi_ij",
                        mintest = 3.84)
  network
  • Comments : The different components of the graph are related to different location of crisis. The major components are related to the “Syrian crisis” and the “Balkan Crisis” but we can also notice interesting dyads like Italy-Libya (boats sinkings in the straits of Sicile), France-UK (Calais), Myanmar-Bangladesh (Rohyngias), etc. The network is a fair mirror of the different locations or groups of locations where migrant and refugees was described as crossing border by the press.

Bibliographie

BARNIER, Julien, 2021. rmdformats: HTML Output Formats and Templates for ’rmarkdown’ Documents [en ligne]. S.l. : s.n. Disponible à l'adresse : https://github.com/juba/rmdformats.
R CORE TEAM, 2020. R: A Language and Environment for Statistical Computing [en ligne]. Vienna, Austria : R Foundation for Statistical Computing. Disponible à l'adresse : https://www.R-project.org/.
XIE, Yihui, 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R [en ligne]. S.l. : s.n. Disponible à l'adresse : https://CRAN.R-project.org/package=knitr.

Annexes

Infos session

setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui X11
language (EN)
collate fr_FR.UTF-8
ctype fr_FR.UTF-8
tz Europe/Paris
date 2021-12-04
package ondiskversion source
data.table 1.13.0 CRAN (R 4.0.2)
dplyr 1.0.2 CRAN (R 4.0.2)
ggplot2 3.3.3 CRAN (R 4.0.2)
knitr 1.34 CRAN (R 4.0.2)
lubridate 1.7.9.2 CRAN (R 4.0.2)
plotly 4.9.2.2 CRAN (R 4.0.2)
quanteda 3.0.0 CRAN (R 4.0.2)
quanteda.textplots 0.94 CRAN (R 4.0.2)
RColorBrewer 1.1.2 CRAN (R 4.0.2)
readr 1.4.0 CRAN (R 4.0.2)
readtext 0.80 CRAN (R 4.0.2)
rmarkdown 2.11 CRAN (R 4.0.2)
rzine 0.1.0 gitlab ()
sf 0.9.8 CRAN (R 4.0.2)
stringr 1.4.0 CRAN (R 4.0.2)
tidytext 0.2.6 CRAN (R 4.0.2)
visNetwork 2.0.9 CRAN (R 4.0.2)

Citation

@Manual{ficheRzine,
    title = {Titre de la fiche},
    author = {{Auteur.e.s}},
    organization = {Rzine},
    year = {202x},
    url = {http://rzine.fr/},
  }


Glossaire